Rom Hacking 201: A non-ASM approach to compressed data Notes on きかんしゃトーマス ソドーとうのなかまたち for GameBoy Color Last Updated: 2022-05-08, abridgewater * Preface Prompted by TF1945 asking for and recieving advice on how to translate this game, Bunkai began producing how-to documents for parts of the romhacking process based on the Discord chat logs. It turns out that some of the game's graphics, particularly the title screen graphics, are compressed, so I thought I'd try to document a "differential analysis" attack on the compression scheme, where I start with the ROM and the decompressed data from VRAM, find the compressed data in the ROM, and determine what I can of its encoding (those familiar with cryptography would probably consider this to be a "known plaintext attack"). What good is working out the compression scheme this way? You could use the information to encode new data by hand with a hex editor, but that would be incredibly tedious. A programmer could write stand alone decompression and compression programs without needing to reverse engineer the decompression routine used by the game itself (meaning that your hypothetical programmer does not need to be able to do assembly hacking). And knowing the basics of the compression scheme would make it easier for an assembly hacker to figure out how the decompression routine works. * Acknowledgements TF1945 for prompting the entire exercise. Bunkai for deciding to use TF1945's questions and the community response as a teaching moment and for feedback. 256, 343, and YasaSheep for feedback and finding out more about the compression scheme. Bootleg Porygon for feedback. Lusoneko, Pluvius, and Calico for contributing to the series and project more generally. And apologies to anyone that I forgot. Mea culpa. * Investigating the Title Screen Step one, find our decompressed data. Start the ROM in an emulator that has debugging support, get to the title screen, and pause the emulation. Next, use the tile viewer to find the address of some of the decompressed tiles in VRAM, then point a memory viewer at that address to see the decompressed data bytes. #+begin_src fundamental 9000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 9010: 00 FF 00 FF 00 FF 00 FF 00 FF 00 FF 00 FF 00 FF 9020: 00 FF 00 FF 00 FF 00 FF 00 FF 00 FF 00 FF 01 FE 9030: 00 FF 00 FF 00 FF 00 FF 00 FF 00 FF 38 C7 FE 01 9040: FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF 9050: 00 FF 00 FF 00 FF 00 FF 00 FF 0E F1 3F C0 7F 80 9060: 07 F8 0F F0 1F E0 1F E0 0F F0 03 FC 00 FF 01 FE 9070: FF 00 FF 00 FF 00 FE 01 F8 07 80 7F 00 FF E0 1F 9080: 01 FE 03 FC 07 F8 07 F8 07 F8 03 FC 03 FC 01 FE 9090: C3 3C EF 10 FF 00 FF 00 FF 00 FF 00 FF 00 FF 00 90A0: C0 3F F8 07 FC 03 FF 00 FF 00 FF 00 FF 00 FF 00 #+end_src That should be enough data to start with. Now, this is a very regular pattern in the beginning, so we'll skip ahead to a significant break in the pattern: At $903c we have the sequence "38 c7 fe 01", the first few bytes that aren't all-bits-set or all-bits-clear or very close to it. Next, we look through the ROM to try and find this sequence. I start by looking for "38 c7", and if that fails then I'll look for "c7 fe". These are deliberately small in order to increase the chance of finding a match, at the cost of having a number of "false positive" matches to weed out. My tools find twenty-two occurrences of "c7 fe" in the ROM, though (due to tool limitations) they won't find any occurrences that straddle an 8-byte boundary. Not too long of a list, and if I had an overview of what's where in the ROM from other analysis, I could use that to try to eliminate candidates that way. First match, $1102b. #+begin_src fundamental 00011020 00 01 00 01 00 01 00 f8 07 c0 3f 38 c7 7c 83 7c |..........?8.|.|| #+end_src Not it. Next match, $1111b. #+begin_src fundamental 00011110 00 01 00 01 00 01 00 f8 07 c0 3f 38 c7 7c 83 7c |..........?8.|.|| #+end_src Not it. Next match, $14128. #+begin_src fundamental 00014120 00 01 00 01 01 00 00 ff 38 c7 27 c0 20 c0 20 c0 |........8.'. . .| #+end_src Not it. Next match, $21169. #+begin_src fundamental 00021160 83 22 7e 8a 22 7e 88 22 21 38 c7 06 00 2a 5f 2a |."~."~."!8...*_*| #+end_src Not it. Next match, $4c012. #+begin_src fundamental 0004c010 09 9f 38 c7 fe 01 ff 00 0c 3f 07 0e ff f1 3f c0 |..8......?....?.| #+end_src Jackpot! And we note the pattern from VRAM $905a, "0e f1 3f c0" appears here at $4c01b as "0e ff f1 3f c0", further confirming that this is the data that we're looking for. That "ff" is a bit suspicious, but we'll get to that. We'll just grab a bit of this data for deeper analysis. #+begin_src fundamental 0004bff0 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 00 |................| 0004c000 00 08 34 05 65 00 00 0d ff 01 0f 13 07 01 fe 1f |..4.e...........| 0004c010 09 9f 38 c7 fe 01 ff 00 0c 3f 07 0e ff f1 3f c0 |..8......?....?.| 0004c020 7f 80 07 f8 0f ff f0 1f e0 1f e0 0f f0 03 f9 fc |................| 0004c030 3f 01 5e 03 fe 01 f8 07 80 ff 7f 00 ff e0 1f 01 |?.^.............| 0004c040 fe 03 f7 fc 07 f8 01 01 03 fc 03 fc bf 01 fe c3 |................| 0004c050 3c ef 10 82 09 c0 9f 3f f8 07 fc 03 94 07 9f 05 |<......?........| 0004c060 80 f1 7f 01 01 91 01 3f 03 0f f0 0c f3 ff 0c f3 |.......?........| 0004c070 0e f1 fe 01 fc 03 ff e0 1f c0 3f 00 ff 0c f3 3f |..........?....?| #+end_src At this point we have a chunk of decompressed data and a couple of places where the decompressed data matches up with data in the ROM, meaning that we've found approximately where the compressed data sits. We don't yet know any details of the compression scheme other than that it includes literal data at times, and we don't know where the compressed data starts or where it ends. To find either end of the compressed data, we want to look forward or backwards through the ROM until we find the compressed data that corresponds with the start or the end of the decompressed data. This, in turn, requires that we work out at least the basics of the overall compression scheme. And the compressed data includes literal data, so what we do is to line it up with the decompressed data and look for patterns. What's in the compressed data that's not in the decompressed data, and what's in the decompressed data that's not in the compressed data? Our starting point is the first match we found, and we work our way forward from there. VRAM $903c: 38 c7 fe 01 ff ROM $4c012: 38 c7 fe 01 ff VRAM $9041: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff VRAM $9050: 00 ff 00 ff 00 ff 00 ff 00 ff ROM $4c017: 00 0c 3f 07 VRAM $905a: 0e f1 3f c0 7f 80 07 f8 0f f0 1f e0 1f e0 0f f0 03 ROM $4c01b: 0e ff f1 3f c0 7f 80 07 f8 0f ff f0 1f e0 1f e0 0f f0 03 There's a pattern here. The compressed data has an "ff" byte followed by eight bytes of literal data, more than once. VRAM $906b: fc 00 ff 01 fe ff 00 ff 00 ff 00 fe 01 f8 07 80 ROM $4c02e: f9 fc 3f 01 5e 03 fe 01 f8 07 80 VRAM $907b: 7f 00 ff e0 1f 01 fe 03 ROM $4c039: ff 7f 00 ff e0 1f 01 fe 03 VRAM $9083: fc 07 f8 07 f8 07 f8 03 fc 03 fc 01 fe c3 ROM $4c042: f7 fc 07 f8 01 01 03 fc 03 fc bf 01 fe c3 A break from pattern at $4c02e, a return to pattern at $4c039, and a break again at $4c042... but there are clearly still literal matches here mixed in with runs of non-literal data. What's the larger pattern? It appears that we have a control byte of some sort, and that affects how the following data is interpreted. For every bit set in the control byte, there is a byte of literal data following. And the clear bits in the control byte signify some other operation, producing more decompressed data than there is corresponding compressed data. Taking this insight a step further, it seems that control bits are consumed LSB-first (note the $f9 at ROM $4c02e followed by a single literal byte, something else, and then five more literal bytes), and that each clear control bit corresponds to two bytes in the compressed data. So, what do these two-byte sequences mean? Well, whenever we see one we can see that it corresponds to a repetition of something from earlier in the decompressed data, making it a backreference, and the entire compression scheme one of the many Lempel-Ziv variants. Overall, this means that the bytes have to encode a /length/ (how many bytes to copy) and a /distance/ (how far back in the decompressed data to start copying). The number of bits used for both length and distance can vary from compression scheme to compression scheme, and affect how long of a length can be encoded and how far back in the decompressed data a reference can reach. In this case, however, it looks like a nice, easy to handle eight bits each. [Spoiler: It isn't. See below... or the next document.] Just from looking at the examples we have above, it appears that the first byte is the position, and is /one less/ than the (backwards!) offset from which to start copying; and the second byte is the length and is /three less/ than the number of bytes to copy. Why is the position value one less than the offset? Because a zero offset would be the /next/ byte to output, which doesn't make any kind of sense, and biasing the position gives that little extra bit of range for a backreference. Why is the length value three less than the number of bytes to copy? A backreference costs seventeen bits to encode (one control bit to signify the backreference, two bytes to encode the position and length), while literal bytes cost nine bits each (one control bit to signify the literal, one byte for the literal). A length of zero or one costs more to encode than using literals (zero or nine bits vs. seventeen bits). A length of two is a very slight savings compared to using literals (eighteen bits vs. seventeen bits). And the same argument for that little extra bit of range applies as for position. So that's the basics of the scheme. There are still two open questions. First, how does the decompressor know when to stop decompressing? Second, where does the compressed data start? There are two ways for the decompressor to know when to stop. Either it knows how long the decompressed data will be and stops when it fills that space, or there is a special end marker that it looks for in the compressed data. The end marker is the more likely of the two, and the only space in the encoding scheme that it can fit is as a backreference with a particular value. This is an easy enough thing to determine: Walk forward through the compressed data until you reach the end. If it just stops without any unusual backreference, it's length-limited, and the length needs to be found. If there /is/ an unusual backreference, then that's the end marker. Ultimately, I'm not too fussed about this, so I'll leave it for someone else to find. [Spoiler: It turned out to be a length limit.] As far as finding the start of the compressed data goes, we had a solid match at ROM $4c012 with "38 c7 fe 01" to VRAM $9038 the same. The immediately preceding ROM byte is "9f", which would be a control byte that calls for five literal bytes, two backreferences, and a further literal, which is precisely the pattern that we found leading up to the blocks of eight literal bytes. This puts VRAM $903c and ROM $4c011 at the point of needing to fetch a control byte: #+begin_src fundamental 9000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 9010: 00 FF 00 FF 00 FF 00 FF 00 FF 00 FF 00 FF 00 FF 9020: 00 FF 00 FF 00 FF 00 FF 00 FF 00 FF 00 FF 01 FE 9030: 00 FF 00 FF 00 FF 00 FF 00 FF 00 FF 38 C7 FE 01 #+end_src #+begin_src fundamental 0004bff0 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 00 |................| 0004c000 00 08 34 05 65 00 00 0d ff 01 0f 13 07 01 fe 1f |..4.e...........| 0004c010 09 9f 38 c7 fe 01 ff 00 0c 3f 07 0e ff f1 3f c0 |..8......?....?.| #+end_src So, we work our way backwards. The preceding byte in VRAM is "ff". The preceding byte in ROM is "09". These are not the same, so we're definitely looking at a backreference, "1f 09". This is twelve ($0c) bytes copied, bringing us to VRAM $9030 and ROM $4c00f. Next, we have two bytes that are the same in both VRAM and ROM, so probably literal bytes, bringing us to VRAM $902e and ROM $4c00d. The preceding byte in VRAM is again "ff", while ROM is "07", signifying a backreference "13 07". This is ten ($0a) byte copied, bringing us to VRAM $9024 and ROM $4c00b. We've also seen a backreference, two literal bytes, and a backreference, meaning that the eventual control byte that we find will have an upper nybble of 0110 binary. There are four more items to find before that control byte. Continuing on, the precdeing VRAM byte is again "ff", while the preceding ROM byte is "0f", so we have a backreference "01 0f". This is a length of eighteen ($12) bytes, putting us at VRAM $9012 and ROM $4c009. And here is where I missed a trick in my initial analysis: The VRAM contents here is a repeated run of "00 ff" for /both/ backreferences, with a total length easily represented in eight bits, so why is it /two/ backreferences? Clearly, the backreference length is limited to four bits (three to eighteen bytes length), and the other four bits in that second byte are used for something else. Within the context of an LZ compression scheme, this means either an escape such as end-of-data or additional position bits. [Spoiler: it turned out to be additional position bits. See the next document.] VRAM $9011 is "ff", ROM $4c008 is "ff". This strongly suggests a literal byte. VRAM $9010 is "00", ROM $4c007 is "0d". This would be "00 0d", a sixteen ($10) byte backreference, bringing us to VRAM $9001 and ROM $4c007. And we've seen seven items (a backreference, two literals, two backreferences, a literal, and a backreference), so there's just the one item left before a control byte. And it's a "00" in both VRAM and ROM, meaning a probable literal. This brings us to the start of the VRAM area that we were looking at, and we should expect to (and do!) see a control byte of "65" at ROM $4c004. The compressed data stream thus starts at ROM $4c004. Four bytes into a 16k bank is an unusual starting point. There /are/ cases where it is a natural boundary, but they involve specific banks (typically the first or last bank of a ROM) and specific CPU semantics (fixed interrupt or cold-start/reset entry points). While some of these /do/ apply to the GameBoy CPU, they /do not/ apply to bank $13 (which is where the compressed data is stored), so we're probably looking at a header. The meaning of the probable-header, "00 08 34 05", isn't immediately obvious, though the least uncomfortable interpretation is as two 16-bit words $0800 and $0534, one or both of which might be some sort of flags or compression type, or could be a length. * Retrospective: What Was Found Since Within a day or so of writing the initial version of this document, 256 informed me of a few additional things, and then YasaSheep took things even further, both using debugger and disassembly techniques (what I consider to be "dynamic" code analysis). So, what did they find, and how could it have been found without having to break out the disassembler? Let's start with information from 256: - The four byte header before the compressed data stream is a 16-bit offset relative to the load address passed to the decompression routine and the length of the compressed data stream. Taking the easy one first, the length of the compressed data stream is how the decompression routine determines when to stop, and while it being determined by the /compressed/ data length would have been a mild surprise it would have been discovered by the process outlined for finding the terminator. As said towards the end of the previous section, "it's length-limited, and the length needs to be found". The offset, in this case $0800, is trickier. With just the one block of compressed data, it would be meaningless. With multiple compressed data blocks there might be a pattern to it that would make it obvious. Or a corruption attack might reveal something. In both cases there's a risk of missing a significant contributing factor (about which more below). There's a potential confusion for the particular compressed data block under analysis: The /decompressed/ length appears to be $0800 bytes, so mistaking the offset for a length is possible. This would have been revealed via a corruption attack or analyzing a second block of compressed data (one with a decompressed length that differs from its offset). Alternately, maybe it /is/ the length of the decompressed data after all? - There is a table of compressed data blocks to load and where to load them per "state" of some sort of state machine, comprising of a 16-bit source address, an 8-bit source ROM bank, a 16-bit target address, and an 8-bit target VRAM bank, repeated for however many compressed blocks, and terminated by three $ff bytes. The specific address given for one of these tables was $00:1360 for the title screen. ROM file offset $4c000 is CPU address $4000 in bank $13, or $13:4000. This is a /lousy/ address to try and find by searching. That said, it is good as additional validation that we have found the correct table. A quick check in the tile viewer shows three or four distinct regions loaded as part of the title screen, two of which are tiles and two of which are not (almost certainly the tile /map/). Finding the compressed data for even one of these like we did for the initial tile area would raise our odds of finding the table considerably. Searching for the CPU address of one of the compressed blocks and finding its bank ID, the CPU addresses of the other blocks, and the bank IDs of those other blocks in proximity would nail down the location of the table. The terminator of three $ff bytes is at that point obvious. From there we see that there are three bytes unaccounted for associated with each compressed block, and their values would strongly suggest VRAM addresses. Corrupting these would serve as a final confirmation. Matching the VRAM addresses in the table to the actual loaded addresses for the data shows general agreement. Including the first compressed data block, the one that we examined above, which suggests that the first word of the compressed data header is /not/ some kind of address offset after all. A second angle for all this is that it's predicated on having already found the end of the first compressed data block, and data that is used together or is of the same kind is often stored together. This means that data that is stored together /can/ be (but is not necessarily) related by more than mere proximity. And it turns out that the next three chunks in the ROM file are the other three compressed data blocks for the title screen, so we might have found the addresses for the table that way. 256 also provided the address of the second compressed data block, the address of the decompression routine, and the register and machine state requirements for calling the decompression routine. Clearly, we could have found the compressed data block, but the information about the decompression routine cannot be found without some form of code analysis. YasaSheep went and found most of the same things that 256 did, but also one more thing (actually the focus of their investigation): - Backreferences are a 12-bit offset and a 4-bit length, the top four bits of the offset being stored in the upper four bits of what superficially appears to be the length byte. This is a distinct difference from what I claimed previously. YasaSheep actually found this because an initial attempt at writing a decompressor in some way didn't work out, presumably producing obviously incorrect output, leading to a code-level analysis of the in-game decompressor. Could this have been figured out from comparing the compressed and decompressed data? Yes. The time-consuming part is actually finding cases where it occurs. Once a place is found where a decompressor fails to match what the "reference implementation" (the game's decompressor) does, along with what the correct output and the compressed data is, differential analysis applies. The critical problem is realizing that there /is/ a disconnect between the model of how the compression scheme works and how the compression scheme actually works. Code analysis doesn't leave these ambiguities, what the code does is what the code does, but it does involve its own tools and techniques. What are other ways to have found this error? Writing a compressor that encodes a run longer than eighteen bytes would trigger a similar failure in that the decompressor would produce unexpected output. Finding a place in existing data that encodes such a run could also indicate that at least /something/ wasn't quite right, since it would be two or more backreferences for no immediately apparent reason, but it requires paying careful attention rather than moving quickly, hence why I missed the critical clue during my initial analysis. Overall, this approach works well for Run-Length Encoding and Lempel-Ziv style byte-oriented compression. It might also work for dictionary compression as can be used for script text in RPGs and the like, though finding the dictionary vs. the compressed text might be an issue there. I don't know that it would work all that well with a lossy compression, or a bit-stream compression such as certain Huffman compression schemes. * EOF